Bump batched MoE BLOCK_M from 16 to 64 on top of persistent kernel by Gasoonjia · Pull Request #19221 · pytorch/executorch

Gasoonjia · 2026-04-30T08:25:30Z

Microbenchmarked on Qwen3.5 MoE prefill (M=1696, top_k=8, 256 experts):
BLOCK_M=16: 3.62 ms
BLOCK_M=32: 2.85 ms (1.27x)
BLOCK_M=64: 2.75 ms (1.32x)

E2E (Qwen3.5-35B-A3B prefill, --moe-activation-dtype int8 --dense-prefill dequant --cuda_graph, p=1600 d=512, run_1..5 median):
BLOCK_M=16: 5897 tok/s prefill (273 ms), 98.1 tok/s decode
BLOCK_M=64: 6793 tok/s prefill (237 ms), 98.1 tok/s decode
Speedup: 1.152x prefill, decode unchanged (decode uses non-batched
fused_moe kernel)

Outputs are bit-identical between BLOCK_M=16 and BLOCK_M=64 in the microbenchmark (max abs diff = 0).

Microbenchmarked on Qwen3.5 MoE prefill (M=1696, top_k=8, 256 experts): BLOCK_M=16: 3.62 ms BLOCK_M=32: 2.85 ms (1.27x) BLOCK_M=64: 2.75 ms (1.32x) E2E (Qwen3.5-35B-A3B prefill, --moe-activation-dtype int8 --dense-prefill dequant --cuda_graph, p=1600 d=512, run_1..5 median): BLOCK_M=16: 5897 tok/s prefill (273 ms), 98.1 tok/s decode BLOCK_M=64: 6793 tok/s prefill (237 ms), 98.1 tok/s decode Speedup: 1.152x prefill, decode unchanged (decode uses non-batched fused_moe kernel) Outputs are bit-identical between BLOCK_M=16 and BLOCK_M=64 in the microbenchmark (max abs diff = 0).

pytorch-bot · 2026-04-30T08:25:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19221

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures, 6 Cancelled Jobs

As of commit 015476d with merge base cb4e5ae ():

NEW FAILURES - The following jobs have failed:

Cadence Build & Test / cpu-test / test-aot / test-aot (gh)
backends/cadence/aot/tests/test_replace_ops_passes.py::TestReplaceOpsPasses::test_replace_transposed_conv_with_linear_4
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t 0cf07f90164c6441de773b862243a867c855f3bd454c875116e4a2ba37dd0937 /exec failed with exit code 139
pull / unittest-editable / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
Test CoreML Backend / test-coreml / test-backend-macos (coreml, operators) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
Test CUDA Builds / unittest-cuda / linux-job (gh)
backends/cuda/tests/test_int4_matmul.py::TestDequantThenMatmul::test_prefill_short
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 1c254529ac0ec942ee2268e69dda5eccc348ccd1ae88d3354c40a8cba2da246a /exec failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (nvidia, parakeet-tdt, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 1d471af106c01901165884a5b51d12526761ee22a4198e828b37164a1bbbcaee /exec failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (nvidia, parakeet-tdt, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 091336f49de18ea5d43a1da3f980167bf098df27fb0ce4af958211d5424b64ef /exec failed with exit code 1

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / test-models-linux (mobilebert, portable, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (mobilebert, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-voxtral-realtime-xnnpack-linux / linux-job (gh)
##[error]The operation was canceled.
Test CUDA Builds / export-model-cuda-artifact (google, gemma-3-4b-it, quantized-int4-tile-packed) / linux-job (gh)
##[error]The operation was canceled.
Test CUDA Builds / export-model-cuda-artifact (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-tile-packed) / linux-job (gh)
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-tile... / linux-job (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump batched MoE BLOCK_M from 16 to 64 on top of persistent kernel#19221

Bump batched MoE BLOCK_M from 16 to 64 on top of persistent kernel#19221
Gasoonjia wants to merge 1 commit intopersistent-on-hoistfrom
block-m-64-on-persistent

Gasoonjia commented Apr 30, 2026

Uh oh!

pytorch-bot Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gasoonjia commented Apr 30, 2026

Uh oh!

pytorch-bot Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19221

❌ 8 New Failures, 6 Cancelled Jobs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented Apr 30, 2026 •

edited

Loading